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We study generalized additive partial linear models, proposing 
the use of polynomial spline smoothing for estimation of nonpara- 
metric functions, and deriving quasi-likelihood based estimators for 
the linear parameters. We establish asymptotic normality for the es- 
timators of the parametric components. The procedure avoids solving 
large systems of equations as in kernel-based procedures and thus re- 
sults in gains in computational simplicity. We further develop a class 
of variable selection procedures for the linear parameters by employ- 
ing a nonconcave penalized quasi-likelihood, which is shown to have 
an asymptotic oracle property. Monte Carlo simulations and an em- 
pirical example are presented for illustration. 

1. Introduction. Generalized linear models (GLM), introduced by Nelder 
and Wedderburn (1972) and systematically summarized by McCullagh and 
Nelder (1989), are a powerful tool to analyze the relationship between a dis- 
crete response variable and covariates. Given a link function, the GLM ex- 
presses the relationship between the dependent and independent variables 
through a linear functional form. However, the GLM and associated methods 
may not be flexible enough when analyzing complicated data generated from 
biological and biomedical research. The generalized additive model (GAM), 
a generalization of the GLM that replaces linear components by a sum of 
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smooth unknown functions of predictor variables, has been proposed as an 
alternative and has been used widely [Hastie and Tibshirani (1990), Wood 
(2006)]. The generalized additive partially linear model (GAPLM) is a realis- 
tic, parsimonious candidate when one believes that the relationship between 
the dependent variable and some of the covariates has a parametric form, 
while the relationship between the dependent variable and the remaining 
covariates may not be linear. GAPLM enjoys the simplicity of the GLM 
and the flexibility of the GAM because it combines both parametric and 
nonparametric components. 

There are two possible approaches for estimating the parametric compo- 
nent and the nonparametric components in a GAPLM. The first is a com- 
bination of kernel-based backfitting and local scoring, proposed by Buja, 
Hastie and Tibshirani (1989) and detailed by Hastie and Tibshirani (1990). 
This method may need to solve a large system of equations [Yu, Park and 
Mammen (2008)], and also makes it difficult to introduce a penalized func- 
tion for variable selection as given in Section 4. The second is an application 
of the marginal integration approach [Linton and Nielsen (1995)] to the non- 
parametric component of the generalized partial linear models. They treated 
the summand of additive terms as a nonparametric component, which is then 
estimated as a multivariate nonparametric function. This strategy may still 
suffer from the "curse of dimensionality" when the number of additive terms 
is not small [Hardle et al. (2004)]. 

The kernel-based backfitting and marginal integration approaches are 
computationally expensive. Marx and Eilers (1998), Ruppert, Wand and 
Carroll (2003) and Wood (2004) studied penalized regression splines, which 
share most of the practical benefits of smoothing spline methods, combined 
with ease of use and reduction of the computational cost of backfitting 
GAMs. Widely used R/Splus packages gam and mgcv provide a convenient 
implementation in practice. However, no theoretical justifications are avail- 
able for these procedures in the additive case. See Li and Ruppert (2008) 
for recent work in the one-dimensional case. 

In this paper, we will use polynomial splines to estimate the nonpara- 
metric components. Besides asymptotic theory, we develop a flexible and 
convenient estimation procedure for GAPLM. The use of polynomial spline 
smoothing in generalized nonparametric models goes back to Stone (1986), 
who first obtained the rate of convergence of the polynomial spline estimates 
for the generalized additive model. Stone (1994) and Huang (1998) investi- 
gated polynomial spline estimation for the generalized functional ANOVA 
model. More recently, Xue and Yang (2006) studied estimation of the ad- 
ditive coefficient model for a continuous response variable using polynomial 
spline methods. Our models emphasize possibly non-Gaussian responses, 
and combine both parametric and nonparametric components through a link 
function. Estimation is achieved through maximizing the quasi-likelihood 
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with polynomial spline smoothing for the nonparametric functions. The con- 
vergence results of the maximum likelihood estimates for the nonparametric 
parts in this article are similar to those for regression established by Xue and 
Yang (2006). However, it is very challenging to establish asymptotic normal- 
ity in our general context, since it cannot be viewed simply as an orthogonal 
projection, due to its nonlinear structure. To the best of our knowledge, 
this is the first attempt to establish asymptotic normality of the estimators 
for the parametric components in GAPLM. Moreover, polynomial spline 
smoothing is a global smoothing method, which approximates the unknown 
functions via polynomial splines characterized by a linear combination of 
spline basis. After the spline basis is chosen, the coefficients can be esti- 
mated by an efficient one-step procedure of maximizing the quasi-likelihood 
function. In contrast, kernel-based methods, such as those reviewed above, in 
which the maximization must be conducted repeatedly at every data point or 
a grid of values, are more time-consuming. Thus, the application of polyno- 
mial spline smoothing in the current context is particularly computationally 
efficient compared to some of its counterparts. 

In practice, a large number of variables may be collected and some of 
the insignificant ones should be excluded before forming a final model. It 
is an important issue to select significant variables for both parametric and 
nonparametric regression models; see Fan and Li (2006) for a comprehensive 
overview of variable selection. Traditional variable selection procedures such 
as stepwise deletion and subset selection may be extended to the GAPLM. 
However, these are also computationally expensive because, for each sub- 
model, we encounter the challenges mentioned above. 

To select significant variables in semiparametric models, Li and Liang 
(2008) adopted Fan and Li's (2001) variable selection procedures for parame- 
tric models via nonconcave penalized quasi-likelihood, but their models do 
not cover the GAPLM. Of course, before developing justifiable variable selec- 
tion for the GAPLM, it is important to establish asymptotic properties for 
the parametric components. In this article, we propose a class of variable se- 
lection procedures for the parametric component of the GAPLM and study 
the asymptotic properties of the resulting estimator. We demonstrate how 
the rate of convergence of the resulting estimate depends on the regulariza- 
tion parameters, and further show that the penalized quasi-likelihood esti- 
mators perform asymptotically as an oracle procedure for selecting the model. 

The rest of the article is organized as follows. In Section 2, we introduce 
the GAPLM model. In Section 3, we propose polynomial spline estimators 
via a quasi-likelihood approach, and study the asymptotic properties of the 
proposed estimators. In Section 4, we describe the variable selection procedu- 
res for the parametric component, and then prove their statistical properties. 
Simulation studies and an empirical example are presented in Section 5. 
Regularity conditions and the proofs of the main results are presented in 
the Appendix. 
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2. The models. Let Y be the response variable, X = (X\, . . . , Xd 1 ) T £ 
R dl and Z = (Z 1 , . . . , Z d2 ) T e i?^ be the covariates. We assume the condi- 
tional density of Y given (X, Z) = (x,z) belongs to the exponential family 

(1) /y|x,z(2/|x, z) = exp[y£(x, z ) - S{£(x, z )} + C(y)] 

for known functions B and C, where £ is the so-called natural parameter 
in parametric generalized linear models (GLM), is related to the unknown 
mean response by 

M (x, z) = E(Y\X = x, Z = z) = B'{£(x, z)}. 

In parametric GLM, the mean function [i is defined via a known link func- 
tion g by g{/u(x,z)} = x t q + z T /3, where a and /3 are parametric vectors 
to be estimated. In this article, g(/i) is modeled as an additive partial linear 
function 

di 

(2) fir{/_i(x,z)} = ^7/ fc (x fc ) + Z T /3, 

k=l 

where (3 is a ^-dimensional regression parameter, {r/fcjfLi are unknown and 
smooth functions and E{r]k{Xk)} = for 1 < k < d\ for identifiability. 

If the conditional variance function var(Y|X = x, Z = z) = a 2 V{fi(x, z)} 
for some known positive function V, then estimation of the mean can be 
achieved by replacing the conditional loglikelihood function log{/y|x,z(y| 
x, z)} in (1) by a quasi-likelihood function Q(m,y), which satisfies 

d y — ttl 

(D\Tin ij) — . 

dm ' V(m) 

The first goal of this article is to provide a simple method of estimating /3 
and {%}fLi i n m odel (2) based on a quasi-likelihood procedure [Severini 
and Staniswalis (1994)] with polynomial splines. The second goal is to dis- 
cuss how to select significant parametric variables in this semiparametric 
framework. 



3. Estimation method. 



3.1. Maximum quasi-likelihood. Let (Yi,Xj, Zj), i = 1, . . . , n, be indepen- 
dent copies of (y, X, Z). To avoid confusion, let rjo = Ylt^iVokixk) and /3 
be the true additive function and the true parameter values, respectively. 
For simplicity, we assume that the covariate is distributed on a com- 
pact interval [ak,bk], k = 1, . . . , d±, and without loss of generality, we take 
all intervals [afc,6fc] = [0,1], k = 1, . . . , d\. Under smoothness assumptions, 
the %fc' s can be well approximated by spline functions. Let S n be the space 
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of polynomial splines on [0, 1] of order r > 1. We introduce a knot sequence 
with J interior knots 

£_ r+ i = ■ • ■ = £_! = £ = < £i < • ■ • < fj < 1 = £7+1 = • • • = £ J+r , 

where J = J n increases when sample size n increases, where the precise order 
is given in condition (C5) in Section 3.2. According to Stone (1985), S n 
consists of functions h satisfying: 

(i) h is a polynomial of degree r — 1 on each of the subintervals Ij = 

j = o,...,j n -i, i Jn = [o„, i]; 

(ii) for r > 2, ft is r — 2 times continuously differentiable on [0, 1]. 

Equally-spaced knots are used in this article for simplicity of proof. How- 
ever other regular knot sequences can also be used, with similar asymptotic 
results. 

We will consider additive spline estimates fj of 770- Let Q n be the collection 
of functions 77 with the additive form r/(x) = Ylt=i Vkixk)^ where each com- 
ponent function 7%. G S n and Y17=i Vki^ik) = 0. We seek a function 77 G Q n 
and a value of (3 that maximize the quasi-likelihood function 

n 

(3) L( V , (3) = n- 1 ^Q^M**) + Z>JP},Yi\. 

i=l 

For the kth covariate x k , let bj k (xk) be the B-spline basis functions of or- 
der r. For any r\ G <5 ra , write r/(x) = 7 T b(x), where b(x) = {bj^(xk),j = 
1, . . . , J n + r, A; = 1, . . . , di} is the collection of the spline basis functions, 
and 7 = {jj t k, j = 1, • • • , Jn + ft k = 1, ■ ■ ■ , <ii} T is the spline coefficient vec- 
tor. Thus, the maximization problem in (3) is equivalent to finding (3 and 7 
to maximize 

n 

(4) £(7, /3) = n" 1 £ Q[p- 1 {7 T b(X i ) + Z?0},Y& 

i=i 

We denote the maximizer as (3 and 7 = j = 1, . . . , J n + r, k = 1, . . . , <ii} T . 

Then the spline estimator of 770 is t/(x) =7 T b(x), and the centered spline 
component function estimators are 

Jn+r n Jn+r 

Vk{xk)= ^2lj,kbj,k{xk) ~ n ~ 1J ^2 lj,kbj,k(X ik ), k = l,...,d 1 . 
j=l i=i j=i 

The above estimation approach can be easily implemented because this ap- 
proximation results in a generalized linear model. However, theoretical jus- 
tification for this estimation approach is very challenging [Huang (1998)]. 
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Let N n = J n + r — 1 . We adopt the normalized B-spline space intro- 
duced in Xue and Yang (2006) with the following normalized basis 



(5) 



Bj,k{x k ) = y/N n < b j+ltk (x k ) / ' b l k (x k ) 



l<j<N n ,l<k<d!, 

which is convenient for asymptotic analysis. Let B(x) = {Bj ;k (x k ),j = 1, . . . , 
N n , k = 1, ... ,d\} T and Bj = B(Xj). Finding (7,/3) that maximizes (4) is 
mathematically equivalent to finding (7,/3) which maximizes 



n 



Then the spline estimator of rjo is rj(x) = -y B(x), and the centered spline 
estimators of the component functions are 

N n n N n 

%{xk) = ^2lj,kB j:k {x k ) - rT 1 ^2^2^ jjk B jjk (X ik ), k = l,...,d 1 . 

3=2 i=l j=2 

We show next that estimators of both the parametric and nonparametric 
components have nice asymptotic properties. 

3.2. Assumptions and asymptotic results. Let v be a positive integer and 
a £ (0, 1] such that p = v + a > 2. Let Ti(p) be the collection of functions g 
on [0, 1] whose vth derivative, g( v \ exists and satisfies a Lipschitz condition 
of order a, (to*) — (m)| < C\m* — m\ a , for < m*,m < 1, where C 
is a positive constant. Following the notation of Carroll et al. (1997), let 
Pi(m) = {dg~ l (m)/dmY /V{g~ l (m)} and q£(m : y) = d e /dm e Q{g~ 1 (m),y}, 
so that 

qi(m,y) = d / dmQ{g~ l (m),y} = {y - g^ 1 {m)} p\(m) , 

02{m;y) = d 2 /dm 2 Q{g~ 1 {m),y} = {y - g~ l (m)} p'^m) - p 2 {m). 

For simplicity of notation, write T = (X, Z) and A® 2 = AA T for any 
matrix or vector A. We make the following assumptions: 

(CI) The function t]q (•) is continuous and each component function %fc(") £ 
Hip), k = l,...,di. 

(C2) The function q2(m,y) < and c q < \q%(m,y)\ < C q iy = 0,1) for 
to G R and y in the range of the response variable. 

(C3) The distribution of X is absolutely continuous and its density / is 
bounded away from zero and infinity on [0, l] dl . 

(C4) The random vector Z satisfies that for any unit vector u> G R di 

c<cj t £(Z® 2 |X = x)lj<C. 
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(C5) The number of knots n 1 /^ < N n < n 1 / 4 . 

Remark 1. The smoothness condition in (CI) describes a requirement 
on the best rate of convergence that the functions %fc(")' s can be approxi- 
mated by functions in the spline spaces. Condition (C2) is imposed to en- 
sure the uniqueness of the solution; see, for example, Condition la of Carroll 
et al. (1997) and Condition (i) of Li and Liang (2008). Condition (C3) re- 
quires a boundedness condition on the covariates, which is often assumed in 
asymptotic analysis of nonparametric regression problems; see Condition 1 
of Stone (1985), Assumption (B3)(ii) of Huang (1999) and Assumption (CI) 
of Xue and Yang (2006). The boundedness assumption on the support can 
be replaced by a finite third moment assumption, but this will add much 
extra complexity to the proofs. Condition (C4) implies that the eigenvalues 
of i£(Z® 2 |X = x) are bounded away from and oo. Condition (C5) gives the 
rate of growth of the dimension of the spline spaces relative to the sample 
size. 

For measurable functions ipi, ip2 on [0, l]^ 1 , define the empirical inner 
product and the corresponding norm as 

n n 

{<Pi,<P2)n = n~ 1 ^2{<pip£ i )(p 2 p£ i )}, \\<p\\l = n- 1 ^2<p 2 (X i ). 

i=l i=l 

If ip i and ip2 are L 2 -integrable, define the theoretical inner product and 
corresponding norm as 

(pi, ^2) = £{<MX)<MX)}, Hi! = ^ 2 (x). 

Let and \\f\W k be the empirical and theoretical norm of ip on [0,1], 

defined by 

IMInfc = n ~ i y2 l P 2 ( x ik), IMliL = Eip 2 (X k ) = / tp 2 (x k )f k (x k )dx k , 

i=l J ° 

where f k { ) is the density function of X k . 

Theorem 1 describes the rates of convergence of the nonparametric parts. 

Theorem 1. Under conditions (C1)-(C5), for k = l,...,di, \\rj— r?o 1 1 2 = 

Op{N\ /2 ~ p + (Njn) 1 / 2 }; \\fj - r, \\ n = O p {N 1 J 2 ~ p + (Njn) 1 / 2 }; 

\\m - mkhk = P {Nn /2 ~ p + {N n /n) l/2 } and \\% - r) k\\nk = P {N^' 
(Njn) 1 / 2 }. 

Let mo(T) = r/o(X) + Z T /3 and define 
^ ' K ) £[p 2 {m (T)}|X=x] 1 ; ' 



+ 
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where 



(7) 



r add (x) = ^r fc (x fc ) 



k=l 



is the projection of T onto the Hilbert space of theoretically centered additive 
functions with a norm ||/||p 2imo = E[f(X) 2 p2{mo(T)}]. To obtain asymp- 
totic normality of the estimators in the linear part, we further impose the 
conditions: 

(C6) The additive components in (7) satisfy that Tfc(-) £ Tl{p), k = 1, . . . , d± 
(C7) For p£, we have 



(C8) There exists a positive constant Co, such that E[{Y — g 1 (mo(T))} 2 | 
T] < Co, almost surely. 

The next theorem shows that the maximum quasi-likelihood estimator 
of (3 is root-ra consistent and asymptotically normal, although the con- 
vergence rate of the nonparametric component rj is of course slower than 
root-n. 



Theorem 2. Under conditions (C1)-(C8), y/n(fl - /3 ) -> Normal(0, ft -1 ) 
where ft = E{p 2 {7n (T)}Z® 2 } . 



The proofs of these theorems are given in the Appendix. 

It is worthwhile pointing out that taking the additive structure of the nui- 
sance parameter into account leads to a smaller asymptotic variance than 
that of the estimators which ignore the additivity [Yu and Lee (2010)]. Car- 
roll et al. (2009) had the same observation for a special case with repeated 
measurement data when g is the identity function. 

4. Selection of significant parametric variables. In this section, we de- 
velop variable selection procedures for the parametric component of the 
GAPLM. We study the asymptotic properties of the resulting estimator, 
illustrate how the rate of convergence of the resulting estimate depends on 
the regularization parameters, and further establish the oracle properties of 
the resulting estimate. 

4.1. Penalized likelihood. Building upon the quasi-likelihood given in (3), 
we define the penalized quasi-likelihood as 



Pe{m )\ <C P and \p e (m) - p e (m )\ < C*\m - m \ 

for all \m — mo\ < C m ,£= 1, 2. 




(8) 
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where PXj(-) is a prespecified penalty function with a regularization param- 
eter Xj. The penalty functions and regularization parameters in (8) are not 
necessarily the same for all j. For example, we may wish to keep scientifically 
important variables in the final model, and therefore do not want to penalize 
their coefficients. In practice, Xj can be chosen by a data-driven criterion, 
such as cross-validation (CV) or generalized cross-validation [GCV, Craven 
and Wahba (1979)]. 

Various penalty functions have been used in variable selection for lin- 
ear regression models, for instance, the Lq penalty, in which Pa, (1/^1) = 
0.5A|/(|/3| 0). The traditional best-subset variable selection can be viewed 

as a penalized least squares with the Lq penalty because Ylj=i \ 7^ 0) 
is essentially the number of nonzero regression coefficients in the model. 
Of course, this procedure has two well known and severe problems. First, 
when the number of covariates is large, it is computationally infeasible to 
do subset selection. Second, best subset variable selection suffers from high 
variability and instability [Breiman (1996), Fan and Li (2001)]. 

The Lasso is a regularization technique for simultaneous estimation and 
variable selection [Tibshirani (1996), Zou (2006)] that avoids the draw- 
backs of the best subset selection. It can be viewed as a penalized least 
squares estimator with the L\ penalty, defined by p\.{\(3\) = Xj\f3\. Frank 
and Friedman (1993) considered bridge regression with an L q penalty, in 
which Pa 3 (|/?|) = (0 < q < 1). The issue of selection of the penalty 

function has been studied in depth by a variety of authors. For example, 
Fan and Li (2001) suggested using the SCAD penalty, defined by 

PA, (/?) = A,{/(/3 < Xj) + ^J^ Kfi > A,) 

for some a > 2 and /3 > 0, 

where p\ j (0) = 0, and Xj and a are two tuning parameters. Fan and Li (2001) 
suggested using a = 3.7, which will be used in Section 5. 

Substituting 77 by its estimate in (8), we obtain a penalized likelihood 

n d.2 

(9) C P {(3) = ^Q[< 7 " 1 {B?7 + ZjP},Yi\ -n^p A ,(|^|). 

i=i j=i 

Maximizing Cp{j3) in (9) yields a maximum penalized likelihood estima- 

~MPL ^MPL 

tor (3 . The theorems established below demonstrate that (5 performs 
asymptotically as well as an oracle estimator. 

4.2. Sampling properties. We next show that with a proper choice of Xj, 
the maximum penalized likelihood estimator /3 has an asymptotic oracle 
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property. Let (3 Q = (f3 w , . . . ,/3d 2 o) T = (Pi , 0i$f , where f3 w is assumed to 
consist of all nonzero components of (3 and /3 20 = without loss of general- 
ity. Similarly we write Z = (ZjjZj)' 1 '. Denote w n = maxi<j<d 2 {|p" (|/3jo|)|> 
(3jo ^ 0} and 

(10) a n = mas : {\p' x (\(3 j0 \)\, (3 j0 ^ 0}. 

1<1<(I2 J 



Theorem 3. Under the regularity conditions given in Section 3.2, and 

-MPL 

if a n — > and w n — > as n — > oo, men mere exists a local maximizer p 
of Cp{(3) defined in (9) such that its rate of convergence is Op(n~ x l 2 + a n ), 
where a n is given in (10). 

Next, define £ n = {p' Xi (\/3 w \) sgn(/3i ), . . . ,p' Xs (\P s o\) sgn(/3 s0 )} T and a di- 
agonal matrix Yl x = diagjp'^ (|/3io|), ■ • • ,p Xs (\/3 s o\)} , where s is the number 
of nonzero components of /3 . Define Ti = (X,Zi) and mo(Ti) = t]q(X.) + 
Zj(3 w , and further let 

^[Z lP2 {m (T 1 )}|X = x] ~ dd 
^[P2{?«o(Ti)}|X = xj 

where Tf dd is the projection of Ti onto the Hilbert space of theoretically 
centered additive functions with the norm ||/||p 2 mQ . 

Theorem 4. Suppose that the regularity conditions given in Section 3.2 

hold, and that lim inf liminf Sj __ i ,o+ ^JnP'x . I) > 0. If y/nXj n — >■ oo as 

j "~-mpl 

n — > oo, then the root-n consistent estimator (3 in Theorem 3 satisfies 
3^ = 0, and ^E(ft s + S A ){3f PL - 1O + (O a + ^a)" 1 ^} -> Normal(0, O a ) , 
u/frere fl s = [p 2 {m (Ti)}Zf }. 

4.3. Implementation. As pointed out by Li and Liang (2008), many 
penalty functions, including the L\ penalty and the SCAD penalty, are ir- 
regular at the origin and may not have a second derivative at some points. 
Thus, it is often difficult to implement the Newton-Raphson algorithm di- 
rectly. As in Fan and Li (2001), Hunter and Li (2005), we approximate the 
penalty function locally by a quadratic function at every step in the iter- 
ation such that the Newton-Raphson algorithm can be modified for find- 
ing the solution of the penalized likelihood. Specifically, given an initial 
value (3^ that is close to the maximizer of the penalized likelihood function, 
the penalty pxjUPj]) can be locally approximated by the quadratic func- 

tion as K(|^|)} / =P / A ,(|^|)sgn(/3 J 0«{p / A .(|/3f D/l/f^, when fif is 
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not very close to 0; otherwise, set j3j = 0. In other words, for /3j ~ Pj > 

PXM\) np^Pfl) + (l/2)K.(|^ 0) |)/|^ 0) |}(/3] - /3f 2 ). For instance, 
this local quadratic approximation for the L\ penalty yields 

|ft| « (l/2)|/3f| + (l/2)/3|/|/3f| for ft « /jf . 

^MPL 

Standard error formula for (3 . We follow the approach in Li and Liang 
(2008) to derive a sandwich formula for the estimator f3 . Let 



d(3 ^' dpd/3 1 



£ A (/3)=diag 
A sandwich formula is given by 

cov(/3 ) = {n^(/3 )-nS A (/3 )}~ 1 5ov{£'(/3 )} 

r „ -MPL, ^MPL i 

x{nf((3 )-n£ A (/3 J}" 1 . 

Following conventional techniques that arise in the likelihood setting, the 
above sandwich formula can be shown to be a consistent estimator and will 
be shown in our simulation study to have good accuracy for moderate sample 
sizes. 

Choice o/ Aj's. The unknown parameters can be selected using data- 
driven approaches, for example, generalized cross validation as proposed in 

Fan and Li (2001). Replacing (3 in (4) with its estimate (3 , we maximize 

l(-f,(3 ) with respect to 7. The solution is denoted by ; y MPL ) and the 
corresponding estimator of T]q is defined as 

(11) ^ PL (x) = ( 7 MPL ) T B(x). 

Here the GCV statistic is defined by 

GCV(A 1 , . . . , X d2 ) - n{1 _ e(Al ,... )Ad2 ) /n} 2 ■ 

where e(Ai, . . . , A da ) = tr[{f (/3 MPL ) - nE A (3 MPL )}-^" @ MPL )] is the effec- 
tive number of parameters and D(Y,/j,) is the deviance of Y corresponding 
to fitting with A. The minimization problem over a ^-dimensional space 
is difficult. However, Li and Liang (2008) conjectured that the magnitude 
of \j should be proportional to the standard error of the unpenalized max- 
imum pseudo-partial likelihood estimator of (3j. Thus, we suggest taking 
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Xj = ASE(/3j) in practice, where SE(/3j) is the estimated standard error 
of /3j, the unpenalized likelihood estimate defined in Section 3. Then the 
minimization problem can be reduced to a one-dimensional problem, and 
the tuning parameter can be estimated by a grid search. 

5. Numerical studies. 

5.1. A simulation study. We simulated 100 data sets consisting of n = 
100, 200 and 400 observations, respectively, from the GAPLM: 

(12) logit{pr(Y = 1)} = m {Xi) + V 2(X 2 ) + Z T /3, 
where 

rji(x) = sin(47rx), 

m ( x ) = 10{exp(-3.25x) + 4exp(-6.5x) + 3exp(-9.75x)} 

and the true parameters (3 = (3, 1.5, 0, 0, 0, 0, 2, 0) T . X\ and X 2 are indepen- 
dently uniformly distributed on [0,1]. Z\ and Z 2 are normally distributed 
with mean 0.5 and variance 0.09. The random vector (Z\, . . . , Zq,X±,X 2 ) 
has an autoregressive structure with correlation coefficient p = 0.5. 

In order to determine the number of knots in the approximation, we per- 
formed a simulation with 1,000 runs for each sample size. In each run, we fit, 
without any variable selection procedure, all possible spline approximations 
with 0-7 internal knots for each nonparametric component. The internal 
knots were equally spaced quantiles from the simulated data. We recorded 
the combination of the numbers of knots used by the best approximation, 
which had the smallest prediction error (PE), defined as 

1 n 

(13) PE = - VVlogit-WT + Zjp) - logit- 1 (r ? (X l ) + Zj(3)} 2 . 

n ^-^ 

i=l 

(2,2) and (5,3) are most frequently chosen for sample sizes 100 and 400, re- 
spectively. These combinations were used in the simulations for the variable 
selection procedures. 

The proposed selection procedures were applied to this model and B- 
splines were used to approximate the two nonparametric functions. In the 
simulation and also the empirical example in Section 5.2, the estimates from 
ordinary logistic regression were used as the starting values in the fitting 
procedure. 

To study model fit, we also defined model error (ME) for the parametric 
part by 

(14) ME(3) = (3 - (3fE(ZZ T )0 - (3). 
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Table 1 

Results from the simulation study in Section 5.1. C, I and MRME stand for the average 

number of the five zero coefficients correctly set to 0, the average number of the three 
nonzero coefficients incorrectly set to 0, and the median of the relative model errors. The 

model errors are defined in (14) 



n 


Method 


C 


I 


MRME 


100 


ORACLE 


5 





0.27 




SCAD 


4.29 


0.93 


0.60 




Lasso 


3.83 


0.67 


0.51 




BIC 


4.53 


0.95 


0.54 


400 


ORACLE 


5 





0.33 




SCAD 


4.81 


0.27 


0.49 




Lasso 


3.89 


0.10 


0.67 




BIC 


4.90 


0.35 


0.46 



The relative model error is defined as the ratio of the model error between 
the fitted model using variable selection methods and using ordinary logistic 
regression. 

The simulation results are reported in Table 1, in which the columns 
labeled with "C" give the average number of the five zero coefficients cor- 
rectly set to 0, the columns labeled with "I" give the average number of the 
three nonzero coefficients incorrectly set to 0, and the columns labeled with 
"MRME" give the median of the relative model errors. 

Summarizing Table 1, we conclude that BIC performs the best in terms 
of correctly identifying zero coefficients, followed by SCAD and LASSO. On 
the other hand, BIC is also more likely to set nonzero coefficients to zero, 
followed by SCAD and LASSO. This indicates that BIC most aggressively 
reduce the model complexity, while LASSO tends to include more variables 
in the models. SCAD is a useful compromise between these two procedures. 
With an increase of sample sizes, both SCAD and BIC nearly perform as 
if they had Oracle property. The MRME values of the three procedures are 
comparable. Results of the cases not depicted here have characteristics sim- 
ilar to those shown in Table 1. Readers may refer to the online supplemental 
materials. 

We also performed a simulation with correlated covariates. We generated 
the response Y from model (12) again but with (3= (3.00,1.50,2.00). The 
covariates Z±, Z2, X\ and X2 were marginally normal with mean zero and 
variance 0.09. In order, (Z±, Z2, X\, X2) had autoregressive correlation coef- 
ficient p, while Z3 is Bernoulli with success probability 0.5. We considered 
two scenarios: (i) moderately correlated covariates (p = 0.5) and (ii) highly 
correlated (p = 0.7) covariates. We did 1,000 simulation runs for each case 
with sample sizes n = 100, 200 and 400. Prom our simulation, we observe that 
the estimator becomes more unstable when the correlation among covariates 




Fig. 1. The mean, absolute value of the bias and variance of the fitted nonparametric 
functions when n = 100 and p = 0.5 [the left panel for ni(xi) and the right for 772 (^2) 7- 
95% CB stands for the 95% confidence band. 



is higher. In scenario (i), all simulation runs converged. However, there were 
6, 3 and 7 cases of nonconvergence over the 1,000 simulation runs for sample 
sizes 100, 200 and 400, respectively, in scenario (ii). In addition, the variance 
and bias of the fitted functions in scenario (ii) were much larger than those 
in scenario (i), especially on the boundaries of the covariates' support. This 
can be observed in Figures 1 and 2, which present the mean, absolute value 
of bias and variance of the fitted nonparametric functions for p = 0.5 and 
p = 0.7 with sample size n = 100. Similar results are obtained for sample 
sizes n = 200 and 400, but are not given here. 

5.2. An empirical example. We now apply the GAPLM and our vari- 
able selection procedure to a data set from the Pima Indian diabetes study 
[Smith et al. (1988)]. This data set is obtained from the UCI Repository of 
Machine Learning Databases, and is selected from a larger data set held by 
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Fig. 2. The mean, absolute value of the bias and variance of the fitted nonparametric 
functions when n = 100 and p = 0.7. The left panel is for rji(xi) and the right panel is 
for 172(3:2). Here 95% CB stands for the 95% confidence band. 



the National Institutes of Diabetes and Digestive and Kidney Diseases. All 
patients in this database are Pima Indian women at least 21 years old and 
living near Phoenix, Arizona. The response Y is the indicator of a positive 
test for diabetes. Independent variables from this data set include: NumPreg, 
the number of pregnancies; DBP, diastolic blood pressure (mmHg); DPF, 
diabetes pedigree function; PGC, the plasma glucose concentration after 
two hours in an oral glucose tolerance test; BMI, body mass index [weight 
in kg/(height in m) 2 ]; and AGE (years). There are in total 724 complete 
observations in this data set. 

In this example, we explore the impact of these covariates on the probabil- 
ity of a positive test. We first fit the data set using a linear logistic regression 
model: the estimated results are listed in the left panel of Table 2. These 
results indicate that NumPreg, DPF, PGC and BMI are statistically sig- 
nificant, while DBP and AGE are not statistically significant. 
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Table 2 

Results for the Pima study. Left panel: estimated values, associated standard errors and 
P-values by using GLM. Right panel: Estimates, associated standard errors using the 
GAPLM with the proposed variable selection procedures 









GLM 






GAPLM 




Est. 


s.e. 


z value 


Pr(>|*|) 


SCAD (s.e.) 


LASSO (s.e.) 


BIC (s.e.) 


NumPreg 


0.118 


0.033 


3.527 





0(0) 


0.021 (0.019) 


0(0) 


DBP 


-0.009 


0.009 


-1.035 


0.301 


0(0) 


-0.006 (0.005) 


0(0) 


DPF 


0.961 


0.306 


3.135 


0.002 


0.958 (0.312) 


0.813 (0.262) 


0.958 (0.312) 


PGC 


0.035 


0.004 


9.763 





0.036 (0.004) 


0.034 (0.003) 


0.036 (0.004) 


BMI 


0.091 


0.016 


5.777 











AGE 


0.017 


0.01 


1.723 


0.085 










20 30 40 50 60 20 30 40 50 60 70 80 

BMI Age 

Fig. 3. The patterns of the nonparametric functions of BMI and Age (solid lines) 
with± s.e. (shaded areas) using the R function, gam, for the Pima study. 

However, a closer investigation shows that the effect of AGE and BMI 
on the logit transformation of the probability of a positive test may be 
nonlinear, see Figure 3. Thus, we employ the following GAPLM for this 
data analysis, 

logit {P(Y = l)} = m + fa NumPreg + foDBP + (3 3 DPF 

(15) 

+ (3 A PGC + m (BMI) + m(AGE). 

Using B-splines to approximate i]i(BMI) and r]2(AGE), we adopt 5-fold 
cross-validation to select knots and find that the approximation with no 
internal knots performs well for the both nonparametric components. 

We applied the proposed variable selection procedures to the model (15), 
and the estimated coefficients and their standard errors are listed in the right 
panel of Table 2. Both SCAD and BIC suggest that DPF and PGC enter the 
model, whereas NumPreg and DBP are suggested not to enter. However, the 
LASSO suggests an inclusion of NumPreg and DBP. This may be because 
LASSO admits many variables in general, as we observed in the simulation 
studies. The nonparametric estimators of r]\(BMI) and rj2(AGE), which are 
obtained by using the SCAD-based procedure, are similar to the solid lines 
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in Figure 3. It is worth pointing that the effect of AGE on the probability of 
a positive test shows a concave pattern, and women whose age is around 50 
have the highest probability of developing diabetes. Importantly the linear 
logistic regression model does not reveal this significant effect. 

It is interesting that the variable NumPreg is statistically insignificant 
when we fit the data using GAPLM with the proposed variable selection 
procedure, but shows a statistically significant impact when we use GLM. 
One might reasonably conjecture that this phenomenon might be due to 
model misspecification. To test this, we conducted a simulation as follows. 
We generated the response variables using the estimates and functions ob- 
tained by GAPLM with the SCAD. Then we fit a GLM for the generated 
data set. We repeated the generation and fitting procedures 5,000 times and 
found that NumPreg is identified positively significant 67.42% percent of the 
time at level 0.05 in the GLMs. For DBP, DPF, PGC, BMI and AGE, the 
percentages that they are identified as statistically significant at the level 
0.05 are 4.52%, 90.36%, 100% and 99.98% and 56.58%, respectively. This 
means that NumPreg can incorrectly enter the model, with more than 65% 
probability, when a wrong model is used, while DBP, DPF, PGC, BMI and 
A GE seem correctly to be classified as insignificant and significant covariates 
even with this wrong GLM model. 

6. Concluding remarks. We have proposed an effective polynomial spline 
technique for the GAPLM, then developed variable selection procedures to 
identify which linear predictors should be included in the final model fitting. 
The contributions we made to the existing literature can be summarized in 
three ways: (i) the procedures are computationally efficient, theoretically re- 
liable, and intuitively appealing; (ii) the estimators of the linear components, 
which are often of primary interest, are asymptotically normal; and (iii) the 
variable selection procedure for the linear components has an asymptotic 
oracle property. We believe that our approach can be extended to the case 
of longitudinal data [Lin and Carroll (2006)], although the technical details 
are by no means straightforward. 

An important question in using GAPLM in practice is which covariates 
should be included in the linear component. We suggest proceeding as fol- 
lows. The continuous covariates are put in the nonparametric part and the 
discrete covariates in the parametric part. If the estimation results show 
that some of the continuous covariate effects can be described by certain 
parametric forms such as a linear form, either by formal testing or by vi- 
sualization, then a new model can be fit with those continuous covariate 
effects moved to the parametric part. The procedure can be iterated several 
times if needed. In this way, one can take full advantage of the flexible ex- 
ploratory analysis provided by the proposed method. However, developing 
a more efficient and automatic criterion warrants future study. It is worth 
pointing out the proposed procedure may be instable for high-dimensional 
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data, and may encounter collinear problems. Addressing these challenging 
questions is part of ongoing work. 

APPENDIX 

Throughout the article, let || • || be the Euclidean norm and H^Hoc = 
sup m |^(m)| be the supremum norm of a function 99 on [0, 1]. For any ma- 
trix A, denote its L2 norm as ||A||2 = sup^n^o || Ax||/||x|| , the largest eigen- 
value. 

A.l. Technical lemmas. In the following, let J 7 be a class of measurable 
functions. For probability measure Q, the L2(Q)-norm of a function / G T is 
defined by (J |/| 2 dQ) 1 / 2 . According to van der Vaart and Wellner (1996), the 
5-covering number M(5,J-,L2(Q)) is the smallest value of M for which there 
exist functions f±, . . . , fj^f, such that for each / G J 7 , ||/ — fj \\ < 5 for some j G 
{1, . . . ,J\f}. The 5-covering number with bracketing J\fu(5, J-, L2(Q)) is the 

smallest value of N for which there exist pairs of functions { [ff , ff\ }f =1 with 
H/j 7 — 7^ || < 5, such that for each / G F, there is a j G {1, . . . ,J\f} such that 
fj"<f< fj 1 ■ The (5-entropy with bracketing is defined as logA/j.](<5, J 7 , L2(Q)). 

Denote J [ . ] (6,T,L 2 (Q)) = J* ^1 + log A/j.] (e, T, L 2 (Q)) de. Let Q n be the 

empirical measure of Q. Denote G n = \/n(Q n — Q) and ||G n || j = supf eJ r\G n f\ 
for any measurable class of functions T . 

We state several preliminary lemmas first, whose proofs are included 
in the supplemental materials. Lemmas 1-3 will be used to prove the re- 
maining lemmas and the main results. Lemmas 4 and 5 are used to prove 
Theorems 1-3. 

Lemma 1 [Lemma 3.4.2 of van der Vaart and Wellner (1996)]. Let M be 
a finite positive constant. Let J- be a uniformly bounded class of measurable 
functions such that Qf 2 < 5 2 and \\f\\oo < Mo- Then 

E* Q \\G n \\j: < C J[.](6,F,L 2 (Q))y.+ " §2 [^ ■ M j, 

where Cq is a finite constant independent of n. 

Lemma 2 [Lemma A. 2 of Huang (1999)]. For any 5 > 0, let 
& n = {tj(x) + z T /3; ||/3 - O || < 8, V G G n , \\ V - Vo h < 
Then, for any e < 8, \ogM [ . ] {8,Q n ,L 2 {P)) < cN n \og{8/e). 

For simplicity, let 

n 

(16) D l = (BT,zT), W n = n- 1 Y / VjV i . 

i=l 
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Lemma 3. Under conditions (C1)-(C5), for the above random matrix W n , 
there exists a positive constant C such that HW" 1 ^ < C, a.s. 

According to a result of de Boor [(2001), page 149], for any function 
g £rl(p) with p < r — 1, there exists a function jG5„, such that \\g — gWoo < 
CNn P , where C is some fixed positive constant. For rjo satisfying (CI), 
we can find 7 = 0j,ktj = lj • • > N nt k = 1, . . . , di} T and an additive spline 
function 77 = -y T B(x) G Q n , such that 

(17) \\rj-r l0 \\ oo = O(N-P). 
Let 

n 

(18) 3 = argmax7i- 1 VQb- 1 {^(X J ) + ZT / 3},y j ]. 

In the following, let m j = m (Ti) = r/o(Xf) + Zj(3 and ^ = Y; - Sf _1 (m i)- 
Further let 

m (t) = »f(x) + z T /3 , m 0i = mo( T i) = fj(Xi) + Zj /3 . 



Lemma 4. Under conditions (C1)-(C5), ~~ Ad) ~~ y Normal(0, 

A" 1 x S1A- 1 ), where is in (18), A = E[p 2 {m (T)}Z® 2 } and £1 = 
E[q 2 {m (T)}Z® 2 ]. 

In the following, denote 6 = (7 ,/3 ) , = (7 ,/3 ) and 
(19) m 4 = m(Ti) = rftX;) + = B?7 + Z?/3. 



LEMMA 5. Under conditions (C1)-(C5), 

||? - e\\ = Op{N 1 J 2 ~v + {N n /ny 1 ' 2 }. 

A. 2. Proof of Theorem 1. According to Lemma 5, 
ll^-^lli = ll(7-7) T B||| = (7-7) T ^ 



n 



i=i 



(7-7) 



<C||7-7ll!, 
thus ||t? - ^|| 2 = O p {A^ /2 ~ p + (N n /n) 1 ' 2 } and 
||?- %|| 2 < ||?- Vh + Wv-Voh = Op{N}J 2 ~ p + {N n /n) 1 ' 2 } + Op(N- p ) 
= P {N l J 2 -v + (N n /n) l l 2 }. 
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By Lemma 1 of Stone (1985), \\% - Vokhk = O p {nI /2 p + (Njn) 1 / 2 }, 
for each 1 < k < d\. Equation (17) implies that [| 77" — rj\\ n = Op{N^ 2 p + 
{Nn/n) 1 ' 2 }. Then 

||?7 - 77o|U < \\V- r}\\n + \\v~VoWn 

= OpIN^-p + {Njn) 1 ' 2 } + Op(N-p) 
= P {N l J 2 -v + {N n /n) l l 2 }. 



Similarly, 



sup 



(vi,m)n - ivi,m) 



P {(log(n)N n /n) 1/2 } 



and \\rf k - r] k\\nk = O p {n}J 2 ~ p + {N n /n) 1 / 2 }, for any k = 1, . . . , d x . 



A.3. Proof of Theorem 2. We first verify that 

n 

(20) n" 1 ^p 2 (m 0i )Z i r(X l ) T (3 - (3 ) = (^(n" 1 / 2 ), 

i=l 
n 

(21) n- 1 Y,{(v-Vo)(^)}P2(m i)Z i = 0p (n" 1 / 2 ), 

i=l 

where Z is defined in (6). 
Define 

(22) K = Wx,z)= t) (x)|z T |3: t )eg„}. 

Noting that P2 is a fixed bounded function under (C7), we have E[(rj — 
?7o)(X)/32(mo)^] 2 < 0(||m — mo|||), for I = 1, . . . , di- By Lemma 2, the log- 
arithm of the e-bracketing number of the class of functions 

Ai(5) = {p 2 {"i(x,z)}{z - T(x)} : m £ M n , \\m - m \\ < $} 

is c{N n log((5/e) + log(5^ 1 )}, so the corresponding entropy integral 

J[.](S,Ai(S), || • ||) < c5{N^ 2 + log 1 / 2 ^ 1 )}. 

According to Lemmas 4 and 5 and Theorem 1, ||m — mo||2 = Op{N}J 2 p + 
{Nrjn) 1 / 2 }. By Lemma 7 of Stone (1986), we have ||^- 7? |U < cNl l2 \\9j- 
r?o||2 = P (Nt P + Nnn- 1 / 2 ), thus 

(23) \\m-m \\oo=Op(N 1 n - p + N n n- 1 / 2 ). 
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E 



n 



- 1 ^{(^- J/ o)(X ! )}p2(m 04 )Z ! -i?[(5?-r, )(X)p 2 {m (T)}Z 



<n-^Cr-\N^ + \ogV\r n )} 



1 + 



Mn 



71 



<0(l)n^ 1 /2 r -i {A rV2 + i og i/2 (rn)}5 

where r~ 1 {Nn + log 1 ^ 2 (r n )} = o(l) according to condition (C5). By the 

definition of Z, for any measurable function <f>, £^[^(X)/92{?tio(T)}Z] = 0. 
Hence (21) holds. Similarly, (20) follows from Lemmas 1 and 5. 

According to condition (C6), the projection function T add (x) = 
Y^k=i^k{%k)-, where the theoretically centered function T k £%{p). By the 
result of de Boor [(2001), page 149], there exists an empirically centered 
function G S®, such that ||Tfc — r^H^ = Op{Nn P ), k = l,...,d\. De- 
note f add (x) = J2k=i^k{x k ) and clearly f add G Q n . For any v G R d2 , define 
m v = m(x,z) + u T {z - f add (x)} = {r}(x) - i/ T f add (x)} + + i/) T z G 7W n , 
where M n is given in (22). Note that rh u maximizes the function l n (m) = 
rr 1 YJl = iQ[9~ l {m{T i )},Y j \ for all m G Al n when v = 0, thus 

^7^n(wi/)|i/=o = 0. For simplicity, denote fhi = m(Tj), and we have 



(24) Q = ^-T n {rhu) 

ov 



n 



v=Q 



- 1 J2q 1 (m i ,Y l )Z i + P (N n 



1=1 



For the first term in (24), we get 

n n 

1 ^2qi(m i ,Yi)Z i =n~ 1 '^2q 1 (moi,Yi)Z i 



n 



i=l 



i=l 



+ n 1 '^2q2(moi,Y i )(m i -m i)Zi 



(25) 



n 



+ n ls jr j q' 2 {mi,Y i )(mi - m 0i ) 2 Zi 



= 1 + 11 + 111. 

We decompose II into two terms Hi and II2 as follows: 

n n 

II = n" 1 q2(m 0i ,Y i )Z i {(ri- m )(X i )} + n~ l ^ q2(m Oi ,Y i )Z i Zj0 - O ) 

i=l i=l 
= III + H 3 . 
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We next show that 

(26) II 1 = II* + o P (n- 1 / 2 ), 

where II* = — re" 1 Y17=i /°2("ioi)Zj{(?7 — r/o)(Xj)}. Using an argument similar 
to the proof of Lemma 5, we have 

(ff - r ?0 )(X i ) = B^KV" 1 jre" 1 ^g^moi, F^D^ + o P (JV) | , 

where K = (Ijv^, O^jx^) and Ijv„di * s a diagonal matrix. Note that the 
expectation of the square of the sth column of re _1//2 (IIi — II*) is 



E 



-1/2 



X){g2 {rnpi , Yj) + p 2 ( rn 0i ) }Z is (rf - rjo) (Xj ) 



n 

n n 

= ra~ 1 ^^-E{eiejp / 1 (moi)p / i(m j)^ s ^j s (??- r/ )(Xj)(r/ - rj )(Xj)} 
i=i j=i 

n n n n 

= ™~ 3 XT XT XT XT E { £ i £ 3 £ k £ iPi(moi)pi(rn 0j )pi(m k)pi(rnoi) 
i=i j=i k=i i=i 

+ o(reA^- 2p ) = o(l), s = l,...,d 2 . 

Thus, (26) holds by Markov's inequality. Based on (21), we have II* = 
op(n" 1 ' 2 ). Using similar arguments and (20) and (21), we can show that 

n 

II 2 = -n- 1 P2(m Ol )Z l Zj0 - p ) + o P (n^ 2 ) 
t=i 



1 £ P2(m w )Zf 2 (3 - /3 ) + op^" 1 / 2 ). 



= — n 

i=l 

According to (23) and condition (C5), we have 

n 

III = ttt 1 £ g 2 (mi, ii)(mj - m 0i ) 2 Zi 
i=i 

<C\\m- m \\l = O p {NW-p) + N 2 n^} 

= op(n~ 1 ' 2 ). 
Combining (24) and (25), we have 

n 

= n- 1 Eg 1 (m 0i ,y 4 )Z i + { J E[p2{mo(T)}Z® 2 ] + p(l)}(3-/3 ) + p(n- 1 / 2 ). 
i=i 
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Note that 

E[pl{m (T)}e 2 Z^] = E[E(e 2 \T)pl{m (T)}Z® 2 } = £[p 2 {m (T)}Z® 2 ]. 
Thus the desired distribution of (3 follows. 

A.4. Proof of Theorem 3. Let r n = n" 1 / 2 + a n . It suffices to show that 
for any given £ > 0, there exists a large constant C such that 



(27) pr{ sup £ P ((3 + r n u) < £ P (f3 )\ > 1 - C- 

l|u||=C 

Denote 

n 

On,i = E^^" 1 ^^) + Z ^o + r n u)),Y t } 

i=l 

and Un t2 = ~nYl S j=\{P\ n {\PjQ +r n Vj\) - Px n (\Pjo\)}, where s is the num- 
ber of components of /3 10 . Note that p\ n (Q) = and P\ n {\f3\) > for all f3. 
Thus, £ P (/3 + r„u) - £ P (/3 ) <U n>1 + U n , 2 . Let m^PL = ^mpl (Xi) + 
For U n i, note that 

n 

i=i 

Mimicking the proof for Theorem 2 indicates that 

n 

(28) C/ n ,i = r„u T <H(moi,Yi)% + ^ 2 u T ftu + o P (l), 

i=l 

where the orders of the first term and the second term are Op{p}^T n ) and 
Op(nr 2 ), respectively. For J7 nj 2, by a Taylor expansion and the Cauchy- 
Schwarz inequality, n~ l U n ^ is bounded by ^fsT n a n ||u|| + r 2 w n ||u|| 2 = 
C T n(Vs + w n C). As w n — > 0, both the first and second terms on the right- 
hand side of (28) dominate U n ^, by taking C sufficiently large. Hence, (27) 
holds for sufficiently large C. 

A. 5. Proof of Theorem 4. The proof of /3 2 =0 is similar to that of 
Lemma 3 in Li and Liang (2008). We therefore omit the details and refer to 
the proof of that lemma. 

Let m MPL (x,zi) = ^ MPL (x) + z T /3 10 , for ^ MPL in (11), and m {T u ) = 
r7o (Xj) + Z£/3 10 . Define Mi n = {m(x,z 1 ) = r?(x) + zj fc-.rje G n }. For any 
v\ € where s is the dimension of /3 10 , define 

■MPL/j. \ „ \ , ,.T~ r^MPL/„\ ..T-n /„M , ^ MPL 



^(ti) = m(x, Zl ) + i/^Z! = {^(x) - ^ri(x)} + (0! + V\) T\. 
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Note that m^ PL maximizes YJU Qi^ 1 Wo(Tii)}, *i] - n EJ=i PA in (l^ PL + 
for all m S A4i n when v\ = 0. Mimicking the proof for Theorem 2 in- 
dicates that 

n 

= n-^g^moCTiO.lftZii + {p'^d^oDsign^o)}^ + c^n" 1 / 2 ) 
i=i 

+ {£[p 2 WT^Zf 2 ] + op(l)}(3r L " /3io) 

+ |e^(^oI)+°hi)|(3,T l -M- 

Thus, asymptotic normality follows because 

n 

= n" 1 ^ gi{mo(Tii), Y$Z Xi + £ n + op(n~ 1/2 ) 

i=l 

+ {S7 S + £ A + 0P (1)}(3 1 1PL " fro). 
ElpHmoiTJUY - mo(T!)} 2 Zf 2 ] = E[p 2 {m (T l )}Zf }. 
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SUPPLEMENTARY MATERIAL 

Detailed proofs and additional simulation results of: Estimation and vari- 
able selection for generalized additive partial linear models 

(DOI: 10.1214/11-AOS885SUPP; .pdf). The supplemental materials contain 
detailed proofs and additional simulation results. 
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